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Abstract 



The Gaussian process (GP) is a popular way to specify dependencies be- 
tween random variables in a probabilistic model. In the Bayesian framework 
the covariance structure can be specified using unknown hyperparameters. 
Integrating over these hyperparameters considers different possible expla- 
nations for the data when making predictions. This integration is often per- 
formed using Markov chain Monte Carlo (MCMC) sampling. However, with 
non-Gaussian observations standard hyperparameter sampling approaches 
require careful tuning and may converge slowly. In this paper we present 
a slice sampling approach that requires little tuning while mixing well in 
both strong- and weak-data regimes. 



1 Introduction 

Many probabilistic models incorporate multivariate Gaussian distributions to explain de- 
pendencies between variables. Gaussian process (GP) models and generalized linear mixed 
models are common examples. For non-Gaussian observation models, inferring the parame- 
ters that specify the covariance structure can be difficult. Existing computational methods 
can be split into two complementary classes: deterministic approximations and Monte Carlo 
simulation. This work presents a method to make the sampling approach easier to apply. 

In recent work Murray et al. ^] developed a slice sampling [5] variant, elliptical slice sam- 
pling, for updating strongly coupled a-priori Gaussian variates given non-Gaussian obser- 
vations. Previously, Agarwal and Gelfand [3j demonstrated the utility of slice sampling for 
updating covariance parameters, conventionally called hyperparameters, with a Gaussian 
observation model, and questioned the possibility of slice sampling in more general settings. 
In this work we develop a new slice sampler for updating covariance hyperparameters. Our 
method uses a robust representation that should work well on a wide variety of problems, 
has very few technical requirements, little need for tuning and so should be easy to apply. 

1.1 Latent Gaussian models 

We consider generative models of data that depend on a vector of latent variables f that are 
Gaussian distributed with covariance Eg set by unknown hyperparameters 6. These models 
are common in the machine learning Gaussian process literature [e.g.H] and throughout the 
statistical sciences. We use standard notation for a Gaussian distribution with mean m and 
covariance E, 



and use f ^ Af{m., E) to indicate that f is drawn from a distribution with the density in (IT] 



A^(f;m,E) = |27rEr'/' exp (-i(f-m)Ts-i(f-m)) , 
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Figure 1: (a) Sho-ws draw s from the prior over f using three different lengthscales in the squared 
exponential covariance ([2|. |(b)| Shows the posteriors over log-lengthscale for these three draws. 



The generic form of the generative models we consider is summarized by 

covariance hyperparameters 9 ^ ph, 

latent variables f ^ A/'(0, Se), 

and a conditional likelihood P(data|f) = >C(f). 

The methods discussed in this paper apply to covariances Eg that are arbitrary positive 
definite functions parameterized by 9. However, our experiments focus on the popular case 
where the covariance is associated with N input vectors {'x.n]n=i through the squared- 
exponential kernel, 

(E,)., = fc(x„ X,) = o) cxp(- lY.tr^^^^^) , (2) 

with hyperparameters 9 — {a'j, {^d}}- Here cr^ is the 'signal variance' controlling the overall 
scale of the latent variables f . The give characteristic lengthscales for converting the 
distances between inputs into covariances between the corresponding latent values f . 

For non-Gaussian likelihoods we wish to sample from the joint posterior over unknowns, 

P(f , 9 1 data) = i £(f ) Ar(f ; 0, E^) Ph[9) . (3) 

We would like to avoid implementing new code or tuning algorithms for different covariances 
Eg and conditional likelihood functions C{{). 

2 Markov chain inference 

A Markov chain transition operator T{z' ^ z) defines a conditional distribution on a new 
position z' given an initial position z. The operator is said to leave a target distribution tt 
invariant if tt{z') = jT{z' z)Tr(z) dz. A standard way to sample from the joint poste- 
rior ([3| is to alternately simulate transition operators that leave its conditionals, P(f | data, 9) 
and P{9\i), invariant. Under fairly mild conditions the Markov chain will equilibrate to- 
wards the target distribution [e.g. E]. 

Recent work has focused on transition operators for updating the latent variables f given 
data and a fixed covariance Eg Updates to the hyperparameters for fixed latent 
variables f need to leave the conditional posterior, 

F(0|f)cxAA(f;O,E0)p„(0), (4) 

invariant. The simplest algorithm for this is the Metropolis-Hastings operator, see Algo- 
rithm [l] Other possibilities include slice sampling 2 and Hamiltonian Monte Carlo [7, 8^. 

Alternately fixing the unknowns f and 9 is appealing from an implementation standpoint. 
However, the resulting Markov chain can be very slow in exploring the joint posterior distri- 
bution. Figure [Ta| shows latent vector samples using squared-exponential covariances with 
different lengthscales. These samples are highly informative about the lengthscale hyperpa- 



rameter that was used, especially for short lengthscales. The sharpness of P{9 \ f ), Figure lb 



dramatically limits the amount that any Markov chain can update the hyperparameters 9 
for fixed latent values f . 
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Algorithm 1 M-H transition for fixed f 

Input: Current f and liyperparameters 6\ 

proposal dist. q\ covariance function Sq. 
Output: Next liyperparameters 



Propose: B' ~ g(6»' ; B) 
Draw u ~ Uniform(0, 1) 

:f „ ^ Af(f;0,Se,)p4fl')g(9;9') 
11 u ^ Ar(f;0,E8)p4e)g(e';S) 

return 6' > Accept new state 

else 

return 6 t> Keep current state 



Algorithm 2 M-H transition for fixed i> 

Input: Current state B, f; proposal dist. q; 

covariance function Sq; likelihood C{). 
Output: Next 6, f 
1: Solve for A"(0, 1) variate: i' = L^H 
Propose B' ~ q{8' ; B) 
Compute implied values: = Ly,^,u 
Draw u ~ Uniform(0, 1) 
if „ ^ c(t')pn(e')q(e;e') 
" " ^ £(f)Pft(e)<j(e';e) 

return f D> Accept new state 

else 

return 6, t > Keep current state 



2.1 Whitening the prior 

Often the conditional likelihood is quite weak; this is why strong prior smoothing as- 
sumptions are often introduced in latent Gaussian models. In the extreme limit in 
which there is no data, i.e. C is constant, the target distribution is the prior model, 
P{i,0) = J\f{{;0,T,g) pii{6). Sampling from the prior should be easy, but alternately fix- 
ing f and 9 does not work well because they are strongly coupled. One strategy is to 
reparameterize the model so that the unknown variables are independent under the prior. 

Independent random variables can be identified from a commonly-used generative procedure 
for the multivariate Gaussian distribution. A vector of independent normals, v, is drawn 
independently of the hyperparameters and then deterministically transformed: 

iy-AA(0,I), i = L^,u, where Ls^LjT^-Se. (5) 

Notation: Throughout this paper Lc will be any user-chosen square root of covariance 
matrix C. While any matrix square root can be used, the lower-diagonal Cholesky decom- 
position is often the most convenient. We would reserve C^/^ for the principal square root, 
because other square roots do not behave like powers: for example, chol(C)~^ ^ chol(C~^). 

We can choose to update the hyperparameters 6 for fixed u instead of fixed f . As the 
original latent variables f are deterministically linked to the hyperparameters 6^ in ([5]), these 
updates will actually change both 6 and f . The samples in Figure [Ta| resulted from using 
the same whitened variable u with different hyperparameters. They follow the same general 
trend, but vary over the lengthscales used to construct them. 

The posterior over hyperparameters for fixed u is apparent by applying Bayes rule to the 
generative procedure in ([5]), or one can laboriously obtain it by changing variables in (|3]): 

^(6*11^, data) cx P(6i, iv, data) = P(6i, f = LE,i/, data) |LsJ oc • • • oc C{i{0,iy)) ph{e). (6) 

Algorithm [2] is the Metropolis-Hastings operator for this distribution. The acceptance rule 
now depends on the latent variables through the conditional likelihood £{{) instead of the 
prior J\f{{; 0, Sg) and these variables are automatically updated to respect the prior. In the 
no-data limit, new hyperparameters proposed from the prior are always accepted. 



3 Surrogate data model 



Neither of the previous two algorithms are ideal for statistical applications, which is illus- 
trated in Figure [2] Algorithm [2] is ideal in the "weak data" limit where the latent variables f 
are distributed according to the prior. In the example, the likelihoods are too restrictive for 
Algorithm [2]s proposal to be acceptable. In the "strong data" limit, where the latent vari- 
ables f are fixed by the likelihood C, Algorithm [l] would be ideal. However, the likelihood 
terms in the example are not so strong that the prior can be ignored. 

For regression problems with Gaussian noise the latent variables can be marginalised out an- 
alytically, allowing hyperparameters to be accepted or rejected according to their marginal 
posterior P(0|data). If latent variables are required they can be sampled directly from 
the conditional posterior P(f 1 61, data). To biuld a method that applies to non-Gaussian 
likelihoods, we create an auxiliary variable model that introduces surrogate Gaussian ob- 
servations that will guide joint proposals of the hyperparameters and latent variables. 
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Figure 2: A regression problem with Gaussian observations illustrated by 2a gray bars. The 
current state of the sampler has a short lengthscale hyperparameter (^ = 0.3); a longer lengthscale 
(^=1.5) is being proposed. The current latent variables do not lie o n a straight enough line for the 
long lengthscale to be plausible. Whitening the prior (Section |2.1[ | updates the latent variables to 
a straighter line, but ignores the observations. A proposal using surrogate data (Section|3] with Se 
set to the observation noise) sets the latent variables to a draw that is plausible for the proposed 
lengthscale while being close to the current state. 




We augment the latent Gaussian model with auxihary variables, g, a noisy version of the 
true latent variables: 

P{g\f,e)^U{g;{,Se). (7) 

For now Se is an arbitrary free parameter that could be set by hand to either a fixed 
value or a value that depends on the current hyperparamete rs 9 . We will discuss how to 
automatically set the auxiliary noise covariance Sg in Section [3.2[ 

The original model, f ~ Af{0,T,g) and Q define a joint auxiliary distribution P{i,g\9) 
given the hyperparameters. It is possible to sample from this distribution in the opposite 
order, by first drawing the auxiliary values from their marginal distribution 

P{g\e)^Af{g;0,j:e + Se), (8) 

and then sampling the model's latent values conditioned on the auxiliary values from 

P{f\g,9) =JV{i; niftg, where some standard manipulations give: 

Ro = {^g^ + Sg^) ^ = T,g — 'Eg{'E0 + Sg) ^T,g = Sg — Sg{Sg + I^g) ^Sg, 

nie,g^^ei^g + Se)-'g^ RgSg^g. (9) 

That is, under the auxiliary model the latent variables of interest are drawn from their 
posterior given the surrogate data g. Again we can describe the sampling process via a 
draw from a spherical Gaussian: 

T] ^ 7V(0,I), f = LjigT] + me,g, where Lr^lI,^ =Rg- (10) 

We then condition on the "whitened" variables rj and the surrogate data g while updating 
the hyperparameters 9. The implied latent variables f (0, ry, g) will remain a plausible draw 
from the surrogate posterior for the current hyperparameters. This is illustrated in Figure[2] 

We can leave the joint distribution ([s]) invariant by updating the following conditional 
distribution derived from the above generative model: 

P{9 1 ri, g, data) cx P(0, r?, g, data) cc £(£(0, r?, g)) Af{g; 0, J:e + Se) Phi9). (11) 

The Metropolis-Hastings Algorithm |3] contains a ratio of these terms in the acceptance rule. 

3.1 Slice sampling 

The Metropolis-Hastings algorithms discussed so far have a proposal distribution q{9'; 9) 
that must be set and tuned. The efficiency of the algorithms depend crucially on careful 
choice of the scale a of the proposal distribution. Slice sampling j] is a family of adaptive 
search procedures that are much more robust to the choice of scale parameter. 
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Algorithm 3 Surrogate data M-H Algorithm 4 Surrogate data slice sampling 



Input: 0, f; prop. dist. q: model of Sec. [3] Input: 6, f; scale a; model of Sec. [3] 

Output: Next 6, f Output: Next f, 6 
1: Draw surrogate data: g~A''{f, 5e) 1: Draw surrogate data: g~A''(f, Se) 

2: Compute implied latent variates: 2: Compute implied latent variates: 



'n = Li^l (f - mftg) r? = Lj^l (f - mg.g) 

Propose d' ~ q(B' : d) 3: Randomly center a bracket: 
Compute function f' = Lfl^,jj + mg'.g i) Uniform(0, cr), 6^,„=e-v, £'max = ^'min + a- 

Draw u ~ Uniform(0, 1) 4: Draw u ~ Uniform(0, 1) 

■f £(f') A^(g;0,S;)/-hSa,) pft(9') g(fl ; 9') 5: Determine threshold: 

£(f)A'(g;0,E<,+S<,)p,(fl)q(fl';e) J/ = 'U,C(f)Ar(g;0,S9 + S8)pfc(e) 



return (?', f > Accept new state 

else 

return 6, f t> Keep current state 



Draw proposal: 6' ~ Uiiiform((9min! ^max) 
Compute function i' = Lr ,7) + mgi g 
if C{f')M{^A),T.g,+Sg.)p,ie') > y 

return f , 6' 
else if 6' <e 

Shrink bracket minimum: drr,\n = 8' 
else 

Shrink bracket maximum: ^?max = 
goto [6] 



Algorithm [4] applies one possible slice sampling algorithm to a scalar hyper parameter 6 in 
the surrogate data model of this section. It has a free parameter tr, the scale of the initial 
proposal distribution. However, careful tuning of this parameter is not required. If the initial 
scale is set to a large value, such as the width of the prior, then the width of the proposals will 
shrink to an acceptable range exponentially quickly. Stepping-out procedures [2] could be 
used to adapt initial scales that are too small. We assume that axis-aligned hyperparameter 
moves will be effective, although reparameterizations could improve performance [e.g. i9j. 

3.2 The auxiliary noise covariance Sg 

The surrogate data g and noise covariance Sg define a pseudo-posterior distribution that 
softly specifies a plausible region within which the latent variables f are updated. The noise 
covariance determines the size of this region. The first two baseline algorithms of Section [2] 
result from limiting cases of Se = ol: 1) if a = the surrogate data and the current latent 
variables are equal and the acceptance ratio reduces to that of Algorithm [ij 2) as a — ?► oo 
the observations arc uninformative about the current state and the pseudo-posterior tends 
to the prior. In the limit, the acceptance ratio reduces to that of Algorithm [2] One could 
choose a based on preliminary runs, but such tuning would be burdensome. 

For likelihood terms that factorize, £(f) — Ci{fi), we can measure how much the likeli- 
hood restricts each variable individually: 

P(/, I A, 0) (X C,if,) AA(/.; 0, i^eU). (12) 

A Gaussian can be fitted by moment matching or a Laplace approximation (matching sec- 
ond derivatives at the mode). Such fits, or close approximations, are often possible analyti- 
cally and can always be performed numerically as the distribution is only one-dimensional. 



Given a Gaussian fit to the site-posterior (12) with variance Vi, we can set the auxil 



iary noise to a level that would result in the same posterior variance at that site alone: 
{Se)ii = — {T,g)ii~^)~^ . (Any negative {Sg)ii must be thresholded.) The moment match- 
ing procedure is a grossly simplified first step of "assumed density filtering" or "expectation 
propagation" |10j . which are too expensive for our use in the inner-loop of a Markov chain. 

4 Related work 

We have discussed samplers that jointly update strongly-coupled latent variables and hy- 
perparameters. The hyperparame ters can move further in joint moves than their narrow 



conditional posteriors (e.g.. Figure lb) would allow. A generic way of jointly sampling real- 
valued variables is Hamiltonian/Hybrid Monte Carlo (HMC) [7JIH]- However, this method 
is cumbersome to implement and tune, and using HMC to jointly update latent variables 
and hyperparameters in hierarchical models does not itself seem to improve sampling |Ilj. 

Christensen et al. have also proposed a robust representation for sampling in latent 
Gaussian models. They use an approximation to the target posterior distribution to con- 
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struct a reparameterization where the unknown variables are close to independent. The 
approximation replaces the likelihood with a Gaussian form proportional to A/'(f ; f, A(f)): 

f = argmaxf £(£), A„-(f) = , (13) 

where A is often diagonal, or it was suggested one would only take the diagonal part. 
This Taylor approximation looks like a Laplace approximation, except that the likelihood 
function is not a probability density in f. This likelihood fit results in an approximate 
Gaussian posterior J\f{{;m.g^^^,Rg) as found in Q, with noise Sg—A({)~^ and data g = f. 

Thinking of the current latent variables as a draw from this approximate posterior, 
ci;~'7V(0, 1), f = L/jg u; + nig j. , suggests using the reparameterization w = L^J^(f — nig j). 
We can then fix the new variables and update the hyperparameters under 

P{0 1 u;, data) cx £(£ (u;, 0)) AA(f (a;, 0); 0, Se) Phid) \LrJ . (14) 

When the likelihood is Gaussian, the reparameterized variables a; are independent of each 
other and the hyperparameters. The hope is that approximating non-Gaussian likelihoods 
will result in nearly-independent parameterizations on which Markov chains will mix rapidly. 

Taylor expanding some common log-likelihoods around the maximum is not well defined, 
for example approximating probit or logistic likelihoods for binary classification, or Pois- 
son observations with zero counts. These Taylor expansions could be seen as giving flat or 
undefined Gaussian approximations that do not reweight the prior. When all of the like- 



lihood terms are flat the reparameterization approach reduces to that of Section 2.1 The 
alternative Sg auxiliary covariances that we have proposed could be used instead. 

The surrogate data samplers of Section [3] can also be viewed as using reparameterizations, 
by treating r] = L]^^{i 

~ rnftg) ^ arbitrary random reparameterization for making pro- 
posals. A proposal density (?(rj', 9'; rj, 0) in the reparameterized space must be multiplied by 
the Jacobian \L]^^^ \ to give a proposal density in the original parameterization. The proba- 
bility of proposing the reparameterization must also be included in the Metropolis-Hastings 
acceptance probability: 

. / P{9' X \d!,t3,)-P{s\i' ,Sg,)-q(e;e')\L„^\ \ 
™^ ' P(f?,f|data)P(g|f,Se)-g(e';e)|L^^ij J' 

A few lines of linear algebra confirms that, as it mu st do, the same acceptance ratio results 
as before. Alternatively, substituting (Is]) into ( 15 ) shows that the acceptance probability 
is very similar to that obtained by applying Metropolis-Hastings to 
Christensen et al. [9 . The differences are that the new latent variabl 
using different pseudo-posterior means and the surrogate data method has an extra term 
for the random, rather than fixed, choice of reparameterization. 

The surrogate data sampler is easier to implement than the previous reparameterization 
work because the surrogate posterior is centred around the current latent variables. This 
means that 1) no point estimate, such as the maximum likelihood f , is required. 2) picking 
the noise covariance Sg poorly may still produce a workable method, whereas a fixed repa- 
rameterized can work badly if the true posterior distribution is in the tails of the Gaussian 
approximation. Christensen et al. [9] pointed out that centering the approximate Gaus- 
sian likelihood in their reparameterization around the current state is tempting, but that 
computing the Jacobian of the transformation is then intractable. By construction, the 
surrogate data model centers the reparameterization near to the current state. 



14) as proposed by 



es f ' are computed 



5 Experiments 

We empirically compare the performance of the various approaches to GP hyperparameter 
sampling on four data sets: one regression, one classification, and two Cox process inference 
problems. Further details are in the rest of this section, with full code as supplementary 
material. The results are summarized in Figure [3] followed by a discussion section. 
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In each of the experimental configurations, we ran ten independent chains with different 
random seeds, burning in for 1000 iterations and samphng for 5000 iterations. We quantify 
the mixing of the chain by estimating the effective number of samples of the complete 
data likelihood trace using R-CODA [T^], and compare that with three cost metrics: the 
number of hyperparameter settings considered (each requiring a small number of covariance 
decompositions with O(n^) time complexity), the number of likelihood evaluations, and the 
total elapsed time on a single core of an Intel Xeon SGHz CPU. 

The experiments are designed to test the mixing of hyperparameters 9 while sampling from 
the joint posterior (|3|. All of the discussed approaches except Algorithm [T] update the latent 
variables f as a side-effect. However, further transition operators for the latent variables for 
fixed hyperparameters are required. In Algorithm[2]the "whitened" variables v remain fixed; 
the latent variables and hyperparameters are constrained to satisfy i = L^gi'. The surrogate 
data samplers are ergodic: the full joint posterior distribution will eventually be explored. 
However, each update changes the hyperparameters and requires expensive computations 
involving covariances. After computing the covariances for one set of hyperparameters, it 
makes sense to apply several cheap updates to the latent variables. For every method we 
applied ten updates of elliptical slice sampling [1] to the latent variables f between each 
hyperparameter update. One could also consider applying elliptical slice sampling to a 
reparamcterized representation, for simplicity of comparison we do not. Independently of 
our work Titsias |13j has used surrogate data like reparameterizations to update latent 
variables for fixed hyperparameters. 



Methods We implemented six methods for updating Gaussian covariance hyperparame- 
ters. Each method used the same slice sampler, as in Algorithm |4j applied to the following 
model representations, fixed: fixing the latent function f [Mj. prior- white: whitening 
with the prior, surr-site: using surrogate data with the noise level set to match the site 
We used Laplace approximations for the Poisson likelihood. 



posterior ( 
cation pro 
well [T5 



12 1. We used Laplace approximations for the Poisson likelihood. For classifi- 
Slems we used moment matching, because Laplace approximations do not work 



surr-taylor: using surrogate data with noise variance set via Taylor expansion of 
the log-likelihood (13 1. Infinite variances were truncated to a large value, post-taylor and 
post-site: as f or t he surr- methods but a fixed reparameterization based on a posterior 
approximation ( 14 1 . 



Binary Classification (Ionosphere) We evaluated four different methods for perform- 
ing binary GP classification: fixed, prior-white, surr-site and post-site. We applied 
these methods to the Ionosphere dataset [16^, using 200 training data and 34 dimensions. 
We used a logistic likelihood with zero-mean prior, inferring lengthscales as well as sig- 
nal variance. The -taylor methods reduce to other methods or don't apply because the 
maximum of the log-likelihood is at plus or minus infinity. 



Gaussian Regression (Synthetic) When the observations have Gaussian noise the 
post-taylor reparameterization of Christensen et al. [iOI makes the hyperparameters and 
latent variables exactly independent. The random centering of the surrogate data model will 
be less effective. We used a Gaussian regression problem to assess how much worse the sur- 
rogate data method is compared to an ideal reparameterization. The synthetic data set had 
200 input points in 10-D drawn uniformly within a unit hypercube. The GP had zero mean, 
unit signal variance and its ten lengthscales in ^ drawn from Uniform(0, VlO). Observation 
noise had variance 0.09. We applied the fixed, prior-white, surr-site/surr-taylor, 
and post-site/post-taylor methods. For Gaussian likelihoods the -site and -taylor 
methods coincide: the auxiliary noise matches the observation noise {Sg = 0.091). 



Cox process inference We tested all six methods on an inhomogeneous Poisson process 
with a Gaussian process prior for the log-rate. We sampled the hyperparameters in (pi) and 
a mean offset to the log-rate. The model was applied to two point process datasets: 1) a 
record of mining disasters [17] with 191 events in 112 bins of 365 days. 2) 195 redwood tree 
locations in a region scaled to the unit square [IB] split into 25x25 = 625 bins. The results 
for the mining problem were initially highly variable. As the mining experiments were also 
the quickest we re-ran each chain for 20,000 iterations. 
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ionosphere synthetic mining redwoods ionosphere synthetic mining redwoods ionosphere synthetic mining redwoods 
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Figure 3: The results of experimental comparisons of six MCMC methods for GP hyperparameter 
inference on four data sets. Each figure shows four groups of bars (one for each experiment) and the 
vertical axis shows the effective number of samples of the complete data likelihood per unit cost. 
The costs are per likelihood evaluation (left), per covariance construction (center), and per second 
(right). Means and standard errors for 10 runs are shown. Each group of bars has been rescaled for 
readability: the number beneath each group gives the effective samples for the surr-site method, 
which always has bars of height 1. Bars are missing where methods are inapplicable (see text). 



6 Discussion 

On the Ionosphere classification problem both of the -site methods worked much better 
than the two baselines. We slightly prefer surr-site as it involves less problem-specific 
derivations than post-site. 

On the synthetic test the post- and surr- methods perform very similarly. We had expected 
the existing post- method to have an advantage of perhaps up to 2-3x, but that was not 
realized on this particular dataset. The post- methods had a slight time advantage, but 
this is down to implementation details and is not notable. 

On the mining problem the Poisson likelihoods are often close to Gaussian, so the exist- 
ing post-taylor approximation works well, as do all of our new proposed methods. The 
Gaussian approximations to the Poisson likelihood fit most poorly to sites with zero counts. 
The redwood dataset discretizes two-dimensional space, leading to a large number of bins. 
The majority of these bins have zero counts, many more than the mining dataset. Taylor 
expanding the likelihood gives no likelihood contribution for bins with zero counts, so it 
is unsurprising that post-taylor performs similarly to prior-white. While surr-ta ylo r 



works better, the best results here come from using approximations to the site-posterior ( 12 1 
For unreasonably fine discretizations the results can be different again: the site- reparam 
eterizations do not always work well. 

Our empirical investigation used slice sampling because it is easy to implement and use. 
However, all of the representations we discuss could be combined with any other MCMC 
method, such as [19] recently used for Cox processes. The new surrogate data and post-site 
representations offer state-of-the-art performance and are the first such advanced methods 
to be applicable to Gaussian process classification. 

An important message from our results is that fixing the latent variables and updating 
hyperparameters according to the conditional posterior — as commonly used by GP practi- 
tioners — can work exceedingly poorly. Even the simple reparameterization of "whitening 
the prior" discussed in Section |2.1| works much better on problems where smoothness is 
important in the posterior. Even if site approximations are difficult and the more ad- 
vanced methods presented are inapplicable, the simple whitening reparameterization should 
be given serious consideration when performing MCMC inference of hyperparameters. 
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